Statistical Disclosure Control

Contact:

Peter-Paul de Wolf
Statistics Netherlands
P.O. Box 24500
2490 HA The Hague
The Netherlands
Phone: +31 70 337 5060

Last update: 10 Oct 2011

Methodology testing (WP 5)

Leading partner: IStat

Participating partners: IStat, UniMan, StBa, ONS

Objectives

This workpackage aims at verifying the effectiveness and applicability of the statistical disclosure control techniques proposed in WP1.1, WP1.2 and WP3, in particular those implemented in Argus.
There are two aspects to assessing the effectiveness of SDC methods for microdata:
- assessment of the extent to which identification is impeded;
- evaluation of the analytical validity of protected data.
Concerning tabular data protection, the following aspects will be considered:
- information loss due to secondary cell suppression (number of suppressions, value of suppressions);
- effectiveness of facilities for control of the selection of secondary suppressions;
- computing resources needed considering size and complexity of structure of the applications.
Application to some prototypical data sets produced by surveys having a Europe-wide perspective will be performed. Technical applicability will be checked and comparisons between different methods will be carried out.

Task TM1 (Responsible UniMan)

Objectives

This builds on the earlier work, which attempted to match household and census records and which yielded very valuable results, in particular highlighting the protection which arises from highly correlated categorical variables. We propose to extend this in the context of the General Household Survey, starting from two or three defined scenarios to define key variables and then assess the availability of matching data across a range of European countries. Data sources for matching attempts will include occupational registers, electoral registers, GP lists, housing information. With appropriate collaboration this could be extended to use government or business files for matching. If the General Household data might eventually not be available an other representative survey will be chosen.

Description of the work

Using the General Household Survey we will attempt to mimic an intruder seeking to establish identification. This work will lead to an assessment of the degree to which identification is impeded by the application of the disclosure control methods implemented in ARGUS. The work would need a NSI resources for validating matches/identification and close collaboration with a NSI over appropriate data sources to use. The result would give an estimate of success given a high level of resource input (e.g. 1 year of research time; advanced computing, etc). It will highlight the weakest points in protection and thus indicate the particularly risky variables or combinations of variables. It will also indicate the increased difficulty of identification after disclosure control methods. The work will also employ the new "Data Intrusion Simulation" method recently developed under funding from the UK Economic and Social Research Council and the US Bureau of the Census, which provides estimates of the probabilities of correct matching against a give target file.

Milestones and expected result

Report on an evaluation of the availability of data sources which could be used for identification purposes, after 12 months.
Report on an assessment of the extent to which identification is impeded by the application of disclosure control methods in ARGUS, after 24 months.

Task TM2 (responsible UniMan)

Objectives

Much of disclosure risk research focuses on the control side of the disclosure issue, asking: "what do we need to do in order to make this data safe?" However, this question is only one side of the problem that a data provider faces in controlling for risk. All risk control methods degrade the data to some extent and therefore reduce the ability of data users to conduct the analyses they need for their legitimate purposes.
These effects fall into two categories:
Reduction of analytical completeness. Some control methods, typically the recoding of taxonomic schemes into coarser categorisations, mean that analyses that could have been conducted with unrecoded data cannot be done. An example is the use of geographical thresholds in microdata sets leading to smaller administration units being grouped together, preventing researchers within those units from effectively using the dataset.
Loss of analytical validity. The loss of analytical validity is harder to define, but in some ways more critical because of its insidious nature. Technically, loss of validity can be said to occur when a disclosure control method has changed a dataset to the point where a user reaches a different conclusion from the same analysis.
Discussion of these two issues is at present pre-theoretical. No principled (or even ad hoc) computational method has been established for the practical assessment of their impact. However, the development of such a method is vital to improving the efficiency of disclosure control techniques, which are at present haphazard in respect of their analytical consequences. This section of the proposed work aims to redress this lack by categorising the effects on analytical power of the full range of disclosure control techniques and by examining the feasibility of developing methods for measuring the scale of such effects.

Description of the work

To turn this complex issue into a tractable problem, the work will focus on datasets available from the 1991 UK census. This will enable the researcher to build on work conducted in preparation for the 2001 census surveying the uses made of UK census microdata as well as four years of work analysing disclosure risk with such data. The work will allow an empirical investigation of the feasibility of assessing the impact of disclosure control techniques on analytical power
* A prototypical set of analyses will be constructed through literature review and through user surveys. These will attempt to cover work that could be done on unprotected or less protected data, as well as work that has already been done on released data.
* The analysis set will be applied to raw data and data protected by various SDC methods (in particular those used by ARGUS). The purpose of this stage is to establish where identifiable conclusions could be affected by SDC methods either in terms of validity or completeness.
* The feasibility of generating information metrics will then be examined. These could indicate, for example, the accurate differentiability of records in the dataset. If feasible, these metrics will then be applied to the dataset, before and after the application of SDC methods (as in II above). The results of this stage will then be calibrated to those of stage II to assess the possibility of providing a general measure of analytical power for data for which criterion values for veridical data usability could be generated.

Milestones and expected results

The generation of a prototypical set of data analyses for 1991 UK census data (8 months).
The categorisation of the effect on those analyses of the application of SDC methods. (16 months)
A conclusion on the plausibility of generalised metrics of analytical power (24 months)
A detailed empirical study of the effect of applying SDC techniques on UK census data with conclusions generalisable to other data and a positive statement on the plausibility of generalised metrics for measuring such effects.

Task TM3 (Responsible StBa)

Objectives

Testing the general applicability of the masking algorithm developed as one of the tasks of WP 1.1 by applying it to other business data.

Description of the work

In order to test the general applicability of the masking algorithm, tests with data sets different from those used during development will be performed. E.g. the general masking procedure developed in task WP 1.1 will be applied to a complex business panel survey (defining subsets, masking the subsets, evaluating analytical validity and sufficiency of the mask). The results of these tests will highlight the impact of specifics of particular data sets on the use of a masking algorithm and allow some conclusions concerning the general option of disseminating these particular business data.

Milestones and expected result

Establish the scope for effective application of masking techniques for anonymising business data.

Task TM4 (Responsible StBa)

Objectives

The aim of this task is to compare the results of the masking to those of other techniques, especially to sophisticated micro-aggregation. On the basis of this experiences a strategy for the dissemination of business data shall be developed, which probably will mix several techniques.

Description of the work

Within this task empirical comparisons of different methods as developed and proposed in WP 1.1 will be performed. Therefore at least two different subsets of the data will be masked applying both Sullivan’s approach and microaggregation techniques (c.f. WP 1.1). The comparisons will consider three different aspects: technical applicability, effectiveness of the method (protection level), analytical validity of the perturbed data. Results will be analysed considering theory. Based on the results of the comparisons a strategy for disseminating cross sectional business data will be proposed, which will probably mix various techniques, combining not only the perturbation techniques as developed in WP 1.1, but also well known non-perturbative techniques such as subsampling, eliminating highly endangered subpopulations, recoding, global and local suppression.

Milestones and expected result

A strategy for anonymising cross sectional business data, suggesting a mix of various perturbative and non-perturbative techniques.

Task TM5 (Responsible IStat)

Objectives

To test the effectiveness of the methods proposed in Task T1 of WP 1.1 by means of application on real data.

Description of the work

The proposed approach will be tested on a business survey: the Community Innovation Survey (CIS). The CIS involves a mixture of continuous and categorical variables and poses considerable confidentiality problems. Statistical analyses will be carried out in order to evaluate possible distortions resulting from the proposed methodologies. We will also assess how much disclosure protection the methods achieve by applying matching algorithms.

Milestones and expected result

We expect to define a strategy for the creation of a microdata file for research from business survey data.

Task TM6 (Responsible StBa)

Objectives

The task aims at evaluating the effectiveness and applicability of methodology for tabular data-protection as proposed in WP 3. Various data-sets from economic surveys will be used (e.g. business tax statistics, structural business survey, etc.)

Description of the work

Tools and methods for secondary cell suppression as proposed and implemented in τ-ARGUS as tasks of WP 3 and WP 4.2 will be applied to single and multiple tables from various economic surveys. Information loss will be assessed by recording number and sum of values of secondary suppressions, effectiveness of the facilities for control of the selection of secondary suppressions will be tested and strategies for use of these facilities will be proposed. The requirement of computing resources in terms of quantity, expected to depend largely on size and complexity of structure of an application, will be recorded.

Milestones and expected result

Establish the scope and propose strategies for effective application of various cell suppression tools and methods as developed and implemented in τ-ARGUS as tasks of WP 3 and WP 4.1 and WP 4.1.